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Summary. The Coronaviridae family, comprising the Coronavirus and Torovirus 
genera, is part of the Nidovirales order that also includes two other families, 
Arteriviridae and Roniviridae. Based on genetic and serological relationships, 
groups 1, 2 and 3 were previously recognized in the Coronavirus genus. In 
this report we present results of comparative sequence analysis of the spike 
(S), envelope (E), membrane (M), and nucleoprotein (N) structural proteins, 
and the two most conserved replicase domains, putative RNA-dependent RNA 
polymerase (RdRp) and RNA helicase (HEL), aimed at a revision of the Coro- 
naviridae taxonomy. The results of pairwise comparisons involving structural 
and replicase proteins of the Coronavirus genus were consistent and produced 
percentages of sequence identities that were distributed in discontinuous clusters. 
Inter-group pairwise scores formed a single cluster in the lowest percentile. No 
homologs of the N and E proteins have been found outside coronaviruses, and the 
only (very) distant homologs of S and M proteins were identified in toroviruses. 
Intragroup sequence conservation was higher, although for some pairs, especially 
those from the most diverse group 1, scores were close or even overlapped with 
those from the intergroup comparisons. Phylogenetic analysis of six proteins using 
a neighbor-joining algorithm confirmed three coronavirus groups. Comparative 
sequence analysis of RdRp and HEL domains were extended to include arterivirus 
and ronivirus homologs. The pairwise scores between sequences of the genera 
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Coronavirus and Torovirus (22-25% and 21-25%) were found to be very close 
to or overlapped with the value ranges (12 to 22% and 17 to 25%) obtained 
for interfamily pairwise comparisons, but were much smaller than values derived 
from pairwise comparisons within the Coronavirus genus (63-71% and 59-67%). 
Phylogenetic analysis confirmed toroviruses and coronaviruses to be separated by 
a large distance that is comparable to those between established nidovirus families. 
Based on comparison of these scores with those derived from analysis of separate 
ranks of several multi-genera virus families, like the Picornaviridae, a revision 
of the Coronaviridae taxonomy is proposed. We suggest the Coronavirus and 
Torovirus genera to be re-defined as two subfamilies within the Coronavirdae or 
two families within Nidovirales, and the current three informal coronavirus groups 
to be converted into three genera within the Coronaviridae. 


Introduction 


The current virus taxonomy universally uses the order, family, genus and species 
ranks to organize all diversity of viruses within a hierarchical system [48, 79]. To 
better reflect an outstanding complexity of similarities found in some virus groups, 
a subfamily rank is also occasionally used. Viruses are assigned to a particular 
taxonomic position according to results of comparative analysis of selected prop- 
erties, characterizing different aspects of the genome and virion structures and the 
replication strategy of viruses. There is no hierarchy in the property list and most 
of the features used are not quantitative. Nevertheless, analysis of genomic data 
has de facto played an increasing role in the past taxonomy revisions. 

The resolving power of comparative sequence analysis was clearly demon- 
strated ina study of the virus capsid gene sequences of the Potyviridae family, when 
diverse strains, species and genera were separated in distinct clusters according 
to pairwise sequence scores [83]. In another highly illustrative case, the results of 
comparative sequence analysis of replicative proteins [4, 41] were most vital for 
a decision to expel Hepatitis E virus from the Caliciviridae family, where it was 
originally placed using other non-sequence properties [5]. Results of comparative 
sequence analysis were also instrumental for the creation of the Arteriviridae 
family [17] and subsequent placement of this and Coronaviridae families into 
a newly designed Nidovirales order currently including also Roniviridae, all of 
which are morphologically different [13, 22, 45]. The experience proved that con- 
served sequence patterns common for this order are more reliable characteristics 
than other properties, including the spliced structural organization of subgenomic 
RNAs that was originally considered a hallmark of the Nidovirales [16, 26, 34, 
46, 53, 59, 66, 71, 74, 84]. 

This study focuses on coronavirus taxonomy. These viruses use single-stranded 
positive-sense RNA genomes of between 28 and 32kb that are packaged in 
enveloped virions with corona- or toro-like morphology [21]. The coronavirus 
genome includes multiple open reading frames (ORFs), with a large replicase 
being encoded in the two 5’-most and overlapping ORFs and the structural and 
auxiliary proteins being expressed from the downstream four or more ORFs. The 
replicase components are autoproteolytically derived from two polyproteins, one 
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of which is produced through a frameshifting during virion RNA translation [7, 
90]. The backbone of the replicase polyproteins includes several uniquely arranged 
conserved domains, two of which have not been found outside the Nidovirales 
order [16, 26, 71]. The non-replicase ORFs are expressed from a 5’- and 3’- 
coterminal nested set of subgenomic viral mRNAs [22, 44]. The Coronaviridae 
family is formed by the genera Coronavirus and Torovirus [21]. 

Using genetic and antigenic criteria, virus species in the genus Coronavirus 
have been organized into groups 1, 2 and 3 [21]. Group | includes porcine Trans- 
missible gastroenteritis virus (TGEV), Feline coronavirus (FCoV), Canine coro- 
navirus (CCoV), Human coronavirus 229E (HCoV-229E) and Porcine epidemic 
diarrhea virus (PEDV). Group 2 members are Murine hepatitis virus (MHV), 
Bovine coronavirus (BCoV), Human coronavirus OC43 (HCoV-OC43), Porcine 
hemagglutinating encephalomyelitis virus (HEV), Rat coronavirus (RtCoV), and 
Equine coronavirus (ECoV). Group 3 is formed by avian Infectious bronchitis 
virus (IBV), Turkey coronavirus (TCoV), and Pheasant coronavirus [10]. The 
current distribution of species into groups | to 3 agrees with previously performed 
phylogenetic analyses [11, 31, 67, 76], although the status of groups within 
a genus is rather provisional and does not correspond to a proper taxonomic 
category. 

Toroviruses were originally proposed to form a new family separated from 
coronaviruses [35]. However, subsequent comparative data analyses led to its 
recognition as a genus within the Coronaviridae [9, 57]. Two torovirus species, 
Bovine torovirus (BToV), originally named Breda virus, and Equine torovirus 
(EToV), have been recognized although toroviruses may also infect other mam- 
mals, including human and swine [20, 42, 55, 71, 81]. The EToV is so far the 
only torovirus that has been propagated in tissue culture and molecularly charac- 
terized [71, 81], although partial genome sequences have also been determined 
for toroviruses infecting other species [20, 42]. 

Due to rapidly accumulating data on the genome structure, expression, and 
virus architecture of coronaviruses and other nidoviruses, it seems appropriate to 
bring up-to-date the taxonomic classification of the Coronaviridae family. In this 
study we performed a systematical quantitative analysis of sequence conservation 
among four structural proteins of the Coronaviridae and two key replicase en- 
zymes, putative RNA-dependent RNA polymerase (RdRp) and helicase (HEL) of 
the Nidovirales. The results were correlated with non-sequence characteristics 
and rationalized using criteria that were derived from analysis of other virus 
families. Our analysis suggests that the Coronavirus and Torovirus genera should 
be re-defined as two subfamilies within the Coronaviridae or two families within 
Nidovirales, and the current informal three coronavirus groups to be converted 
into three genera within the Coronaviridae. 


Materials and methods 


Comparative sequence analyses 


Databases searches were done using the BLAST program [1] available through the WU- 
BLAST2 server [87]. Amino acid sequences were obtained from the SWISS-PROT/TrEMBL 
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[50] and PIR [86] databases. For the structural proteins, only full-length sequences were 
included in the analysis. For the replicase domains, the sequences including the conserved 
motifs of the RdRp [40] and HEL [27, 38] that corresponded to fragments 513-820 and 1218- 
1512, respectively, of MHV ORF 1b (accession number P16342) were used. In total, 73 S, 
44 E, 57 M, 66 N, 19 RdRp and 14 HEL sequences were analyzed and they are listed in 
respective figures. Note that for each protein a unique set of sequences was analyzed and the 
protein-specific sets overlapped to different extent. 

Sequences were aligned with the CLUSTAL X program v. 1.82 [77] and the alignments 
were curated with T-COFFEE v. 1.32, that combines local and global multiple alignments and 
yields more accurate sequence alignments than other available methods [49]. Some alignments 
were verified using the MACAW program [62] and were manually adjusted. 

The statistical significance of the similarity between the sequences included in the multiple 
alignments was verified applying the PSI-BLAST [2], LAMA [56], and MACAW [62] 
programs. 

The PSI-BLAST program mediates iterative searches that start with a query and involve 
building a position-specific scoring matrix from sequences similar to the query to be used as 
input to the next round of searching. The search continues and an alignment expands with 
new sequences until the results convergence when no new hits above a statistically significant 
threshold are recorded. In this study, every sequence to be compared was used as a query in 
iterative PSI-BLAST searches against the non-redundant (nr) peptide sequence database with 
an inclusion E threshold being 0.05. This value indicates that the threshold similarity may 
be observed by chance once per any sequence search of a database 20 times as big as that 
that was actually searched. We considered similarities among all sequences in a group to be 
Statistically significant if outputs of searches that were initiated with every group sequence 
formed a continuous network of matches. 

The most conserved regions in sequence alignments are known to form ungapped blocks 
(ungapped local multiple alignments). Such blocks can be derived from multiple alignments 
employing the Block Maker [33] and used as a query in searches mediated by the LAMA. Both 
programs and other tools are run through the Blocks web server (http://www.blocks.fhcrc.org). 
The LAMA program searches for statistically significant similarities between blocks of 
an alignment and a blocks database derived from another alignment (a protein family) or 
from all documented families of related proteins forming the Blocks database [32]. A hit 
is considered relevant if its Z-score, the number of standard deviations between the blocks 
alignment score and a mean score previously calculated for the entire Blocks database, was 
above the score cut-off of 5.6. In this study, the Block Maker was used to convert multiple 
alignments containing groups of coronavirus or torovirus sequences into the alignment blocks 
databases. Then, LAMA performed inter-databases comparisons in a block-versus-block 
mode and also used blocks of each alignment as a query to search the complete Blocks 
database. 

To evaluate similarity between distantly related toro- and coronavirus sequences, the 
MACAW program was used. The MACAW program identifies conserved ungapped blocks in 
a group of sequences, assesses statistical significance of intra-block similarity and combines 
blocks in a multiple sequence alignment containing inter-block unaligned regions. To avoid 
distortion of the statistical calculations, closely related sequences must be excluded from the 
analysis. In this study, we used MACAW to align representatives of three coronavirus groups 
and a torovirus sequence. If the intra-block similarity of these sequences was statistically 
significant (probability of finding the same or higher score by chance was not more than 0.01) 
and this probability became less likely after removal of any sequence from this alignment, 
then the intra-block relationship of all aligned sequences was considered to be statistically 
significant. 
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Distance and phylogenetic analyses 


The obtained alignments were used as input for the distance and phylogenetic analyses. 
Uncorrected distances for every pairwise sequence comparison (percentage of sequence 
identity) were calculated with DISTANCES from the GCG package (Womble, 2000). The 
calculated distances were further grouped in the 2% intervals and the obtained figures were 
plotted on the frequency versus identity percentage histograms using Microsoft Excel 2001. 

Dendrograms were computed by successively using four programs included in the PHYLIP 
package v.3.6a3 [25]. SEQBOOT generates resampled versions of an input data set, and it was 
used to create 1000 bootstrapped data sets from each alignment. Distance matrices summariz- 
ing pairwise comparisons within each one of the multiple alignment data sets were obtained 
with PROTDIST according to the Jones-Taylor-Thornton model of amino acid substitutions 
[37]. The distance matrices were fed to NEIGHBOR to compute the dendrograms by applying 
the Neighbor-joining method that constructs a tree by successive clustering of lineages 
[58]. Finally, from the multiple trees obtained for each original alignment, the majority rule 
consensus tree showing the bootstrap values in the nodes was calculated by CONSENSE. 

Alternatively, consensus unrooted Neighbor-joining dendrograms were obtained with 
CLUSTAL X v1.82 starting from 1000 bootstrapped replicates of each alignment. The phylo- 
genetic trees obtained with PHYLIP and CLUSTAL X had similar topologies. The CLUSTAL 
X dendrograms are shown in this article. 

For dendrograms containing S and M proteins, RdRp and HEL sequences, roots were 
inferred with the corresponding torovirus homologous sequences as outgroups. For this 
purpose, the statistical significance of the relationships between coronavirus and torovirus 
structural protein sequences was assessed (see below). 

The phylogenetic trees were plotted with the NJplot program [54] and the TreeView 
program v. 1.6.6 (Page, 1996) and manually edited. 


Results 


Generation of coronavirus-wide alignments of four structural 
proteins and two replicative domains 


To perform a comprehensive comparative sequence study of coronaviruses, the 
two most conserved domains, putative RdRp [29] and HEL [28, 63, 64], that are 
part of the replicase polyproteins, and the four structural proteins common to all 
coronaviruses (N, M, E, and S) have been selected. 

The Psi-Blast-mediated searches retrieved all coronavirus N, M and S proteins 
as separate groups that were subsequently aligned as described in the Material 
and Methods. Similar searches that were performed with E proteins produced 
four different families, two for coronavirus group | and one for each groups 2 
and 3; these families are also listed in the protein family (PFAM) database [3]. To 
check whether these protein families are related, we performed LAMA-assisted 
across-families comparisons using 2 or 3 ungapped blocks that were derived 
from alignments of these protein families with the Block Maker tool. A four- 
families-wide network of statistically significant interblock matches was detected 
in pairs of different protein families excluding only the families 2 and 3 pair. These 
data and similar genetic positions support the common origin of the different E 
proteins. Accordingly, four group-specific E protein alignments were merged into 
one coronavirus-wide alignment using the Clustalx1.82 and T-Coffee programs. 
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The PSI-BLAST- and LAMA-mediated searches did not bring statistically 
significant matches between the structural proteins of the two genera of the 
Coronaviridae. Toroviruses have three structural proteins functionally equivalent 
to the S, M and N proteins of coronaviruses. The S and M proteins are of similar 
sizes, while the N protein is about 75% smaller in toroviruses. Based on these 
biological grounds, we compared S and M proteins of corona- and toroviruses 
using the MACAW program. Three statistically significant and colinear regions 
have been found in the C-terminal half of S proteins and two such regions were 
delineated in M proteins. The C-terminal half is the most conserved part of S 
protein and released as S2 moiety by a cleavage of S protein. The identified 
conserved regions enabled the generation of the Coronaviridae-wide multiple 
alignments of S and M proteins using the Clustal x 1.82 and T-Coffee programs. 

The Coronaviridae-wide alignments of the most conserved regions of the 
RdRp and HEL were produced to include the characteristic motifs of these proteins 
[28, 29]. 


Three genetic groups are consistently evident upon analysis 
of pairwise distances of six proteins of the Coronavirus genus 


The all-inclusive alignments of six coronavirus proteins were used to produce 
respective matrices with percentages of the pairwise sequence identity. These 
matrices were further processed to derive and individually plot results for three 
coronavirus genetic groups and four inter-group combinations for each protein. 
Inspection of the 42 histograms obtained showed that the calculated identity 
percentages are not distributed continuously, but rather group in discrete clusters. 
Analysis of these distributions is given below. 

Overall results for the four structural proteins were similar (Fig. 1). Frequency 
distributions of identity percentages that were derived from intragroup 1 compar- 
isons formed two main clusters. The rightmost one, which was discontinuous and 
included identity percentages from 78 to 100 (S), 75 to 100 (E), 82 to 100 (M) and 
74 to 100 (N) (GI in Figs. 1A, B, C and D), included distances between strains 
from the same or closely related species. The leftmost cluster was compact and 
showed lower identity percentages ranging from 42 to 52 (S), 23 to 31 (E), 42 to 57 
(M), and 34 to 41 (N). These figures were generated from pairwise comparisons 
between viruses of the two group | subsets, one including TGEV, CCoV, and FCoV 
(G1-1), and the other consisting of HCoV-229E and PEDV (G1-2). Viruses that 
belong to the two different subsets may lack the antigenic cross-reactivity [59]. 

The intragroup 2 comparisons also showed identity percentages that formed 
two clusters (G2 in Figs. 1A, B, C and D). The rightmost one included protein 
identity percentages from 81 to 100 (S), 89 to 100 (E), 92 to 100 (M) and 89 
to 100 (N) that were generated from comparisons between the most closely 
related sequences. The other cluster included percentage scores that ranged from 
65 to 69 (S), 61 to 70 (E), 79 to 85 (M) and 69 to 76 (N). It corresponded 
to comparisons between species of two subgroups, one that includes murine 
coronaviruses (MHV and RtCoV), and the other including HCoV-OC43, BCoV 
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and HEV. It is evident that the intragroup 2 pairwise sequence differences are 
significantly less pronounced than those found for the group 1. 

Pairwise identity percentages within group 3 that is formed by two closely 
related species IBV and TCoV were accordingly high and clustered compactly 
for three proteins, S (from 82 to 100%), E (83—100%) and M (80—100%) (G3 in 
Figs. 1A, B, and C). However, comparisons of protein N sequences revealed two 
clearly separated distance clusters, a rightmost including high percentage scores 
(from 88 to 100%) comparable to that of other proteins, and a leftmost with the 
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Fig. 1. Frequency distributions of pairwise identity percentages of coronavirus structural 
proteins. Amino acid sequences of proteins S (A), E (B), M (C) and N (D) were aligned 
with CLUSTAL X and T-COFFEE to generate pairwise score matrices including percentages 
of identical residues in each pair of sequences. Each protein matrix was produced from 
comparisons involving coronaviruses of three genetic groups (G1 + G2 + G3), and was used 
to derive submatrices involving sequences of group | (G1), group 2 (G2), group 3 (G3), 
groups | and 2 (G1+G2), groups 1 and 3 (G1+G3), and groups 2 and 3 (G2+G3). 
These matrices of each protein were processed to plot frequency distributions of percentage 
scores that were rounded with the step of 2%. Histograms of the intra-group scores were 
colored in dark gray and those of the inter-group scores in light gray 


pairwise identities in a range from 60 to 65% (G3 in Fig. 1D). This second unique 
cluster originated from comparisons involving two species subsets, one prototyped 
by IBV Beaudette strain and other involving three IBV strains N1/88, Q3/88 and 
V18/91 [60]. The observed differences in the patterns of score distribution among 
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four proteins may be rationalized after the genomic characterization of the above 
three IBV strains is extended beyond the N protein gene. 

Comparisons involving sequences from two different coronavirus groups pro- 
duced pairwise identity percentages distribution that combined distributions of 
two groups and included a new cluster containing intergroup pairwise scores 
(G1+ G2, G1+G3 and G2+ G3 in Figs. 1A, B, C and D). For three different 
combinations of two groups, intergroup scores formed the leftmost clusters that 
were clearly separated from all intragroup clusters except for the leftmost G1 


POLYMERASE 
5 
; G1 ie Lu 
) G2 
35 
30 
25 
20 
15 
10 
5 
0 
5+ G3 
0 f ie 
45 
40 | G1+G2 = 
35 
30 
25 
20 
15 
10 
5 
: dit] i 
10} G14+G3 n 
5 
0 ae 
“) G2+G3 
35 
30 
25 
20 
15 
10 
5 
0 
50+ G1+G2+G3 in 
45 
40 
35 
30 
25 
20 
15 
10 
5 
: a a 


Identity % 


Frequency 


HELICASE 


G1 


G2 


G3 


G1+G2 


G1+G3 


G2+G3 


G1+G2+G3 


Identity % 


Fig. 2. Frequency distributions of pairwise identity percentages of coronavirus replicase 
proteins. Amino acid sequences of the RNA-dependent RNA polymerase (A) and the RNA 
helicase (B) domains were analyzed in a manner described in legend to Fig. 1 for structural 
proteins of coronaviruses 
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clusters of E, M and N proteins, with which they were either overlapped (E and 
N) or formed a continuous supercluster (M). 

Finally, four protein-specific histograms combining all available pairwise 
scores were generated (G1 + G2+G3 in Figs. 1A, B, C and D). Inspection of 
these histograms revealed overlapping ranges of sequence intergroup identities for 
each protein: from 20 to 29% (S), 9 to 30% (E), 21 to 41% (M) and 23 to 35% (N). 

Similar pair-wise comparisons involving two replicase protein domains (Fig. 2) 
yielded percentage identities higher than those of the structural proteins. This 
difference was expected given that these domains are conserved also outside 
coronaviruses [27, 40]. Thus, pairwise identity percentages among RdRp (Fig. 2A) 
and HEL (Fig. 2B) sequences from different groups ranged from 63 to 71% 
(G1+G2+G3 in Fig. 2A), and from 59 to 67% (G1 + G2+G3 in Fig. 2B), 
respectively. These values border lined the respective intragroup figures that 
ranged from 78 to 100% (G1, G2 and G3 in Fig. 2A) and from 77 to 100% 
(G1, G2 and G3 in Fig. 2B), respectively. Because of the limited sampling of 
analyzed sequences, especially for the RNA helicase, the histograms were not as 
well defined as those for the structural proteins. 

Thus, the results of pairwise comparisons involving structural and replicase 
proteins of the Coronavirus genus were consistent and produced percentages of 
sequence identities that were distributed in discontinuous clusters. Inter-group 
pairwise scores formed a single cluster with the lowest percentile. In contrast, 
intragroup pairwise scores were higher, although for some pairs, especially those 
from the most diverse group 1, scores were close or even overlapped with those 
from the intergroup comparisons. 


Three genetic groups are supported by phylogenetic analysis 
of six proteins of the Coronavirus genus 


The dendrograms involving six analyzed proteins revealed similar topologies 
(Fig. 3 to 8) that were compatible with the current distribution of coronavirus 
species among three groups [21, 67]. Each coronavirus group is supported by 
high bootstrap values for every protein analyzed. Group 3 is relatively compact, 
currently including data from just two species. Each of groups 1 and 2 includes 
two subsets of species. In the moderately diverged group 2, a subset including 
MHV and RtCoV species was confidently separated from the other including 
HCoV-OC43, BCoV and HEV species. In the most diverse and diverged group 1, 


< 
Fig. 3. Phylogenetic tree of S proteins of coronaviruses. The tree was generated using an 
alignment of the S2 part of S protein sequences from 74 different virus isolates by applying 
the Neighbor-joining method in the CLUSTAL X v1.82 program. The sequence of the Berne 
torovirus (BEV) S protein was used as an outgroup. Bootstrap values higher than 50% 
are shown in the main branch nodes. AN, SWISS-PROT/TrEMBL/PIR databases accession 
number; SP, abbreviation of the official species name; STR, strain name. In the AN column, 
TN449 and SEG85 correspond to sequences that have been obtained in the author’s laboratory 
and have not been published 
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Fig. 4. Phylogenetic tree of E proteins of coronaviruses. The tree was generated using an 

alignment of E protein sequences from 44 virus isolates by applying the Neighbor-joining 

method in the CLUSTAL X v1.82 program. Bootstrap values higher than 50% are shown in 
the main branch nodes. AN, SP, and STR as in Fig. 1 


TGEV, CCoV, FCoV, forming subgroup G1-1, were separated from substantially 
diverged HCoV-229E and PEDV, forming another subgroup G1-2. The topologies 
of group | sub-trees were found to be very similar for the M and N proteins but 
deviated for the other two proteins with respect to the positions of FCoV isolates. 
Pairwise similarity of the feline infectious peritonitis virus (FIPV) KU-2 strain 
S protein was unusually low (45% of identical residues) with homologs encoded 
by other group | coronaviruses [47] including other strains of FIPV. Furthermore, 
this virus failed a confidence test for the inclusion in the subgroup G1-1 (Figs. 1 
and 3). The FIPV 79-1146 and the feline enteric coronavirus (FECoV) 79-1683 
strains were interleaved with TGEV and other coronaviruses in the E protein tree 
(Fig. 4), but together with FIPV KU-2 formed the compact cluster in the trees 
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Fig. 5. Phylogenetic tree of M proteins of coronaviruses. The tree was generated using an 
alignment of M protein sequences from 58 virus isolates by applying the Neighbor-joining 
method in the CLUSTAL X v1.82 program. The sequence of the BEV M protein was used 
as an outgroup. Bootstrap values higher than 50% are shown in the main branch nodes. AN, 

SP, and STR, as in Fig. 1 


of M and N proteins (Fig. 5 and 6). These topological anomalies indicate that 
recombination may have contributed to the evolution of group | viruses, and this 
aspect is worth further analysis that is beyond the scope of this paper. Since our 
findings are limited to the same group 1, they do not undermine the classification 


of coronaviruses in three main groups. 
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Fig. 6. Phylogenetic tree of coronavirus N protein amino acid sequences. The tree was 

generated using the sequences of N protein from 66 virus isolates by applying the Neighbor- 

joining method in the CLUSTAL X v1.82 program. Bootstrap values higher than 50% are 

shown in the main branch nodes. AN, SP, and STR, as in Fig. 1. EqCoV, putative equine 
coronavirus 
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Fig. 7. Phylogenetic tree of putative RNA-dependent RNA polymerases of coronaviruses. 

The tree was generated using an alignment of the RdRp domain from 20 virus isolates by 

applying the Neighbor-joining method in the CLUSTAL X v1.82 program. The sequence of 

the BEV RdRp was used as an outgroup. Bootstrap values higher than 50% are shown in the 
main branch nodes. AN, SP, and STR, as in Fig. 1 
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Fig. 8. Phylogenetic tree of RNA helicases of coronaviruses. The tree was generated using 

an alignment of the HEL domain from 15 different virus isolates by applying the Neighbor- 

joining method in the CLUSTAL X v1.82 program. The sequence of the BEV HEL was used 

as an outgroup. Bootstrap values higher than 50% are shown in the main branch nodes. AN, 
SP, and STR, as in Fig. 1 


The trees for S (Fig. 3), M (Fig. 5), RdRp (Fig. 7) and HEL (Fig. 8) proteins 
were rooted using the torovirus homologs as outgroups. Several topologies are 
evident in these trees. The 1‘t and 2" groups were clustered in the M and HEL 
trees, and the 1%t and 3 groups were clustered in the S tree, while no reliable 
intergroup clustering was observed in the RdRp tree. The possible causes of these 
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variations remain unknown although they may purely be due to technical reasons 
(e.g. small sampling size and/or large distances between the outgroup and other 
sequences). In this respect, the root position in the S protein tree may be the 
least reliable due to the most pronounce divergence of the respective torovirus 
sequence. However, these unresolved complexities do not compromise the group 
structure of coronaviruses. 
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Fig. 9. Frequency distributions of pairwise identity percentages of two replicase domains in 
the Nidovirales. Amino acid sequences of putative RdRp (A) and the HEL (B) domains were 
retrieved from databases and aligned with CLUSTAL X and T-COFFEE to generate pairwise 
score matrices including percentages of identical residues in each pair of sequences. These 
matrices were processed to plot frequency distributions of percentage scores that were rounded 
with the step of 2%. Three groups of scores were colored and they correspond to comparisons: 
(i) between coronaviruses from different groups (black), (ii) between nidoviruses from differ- 
ent families (light gray), and (iii) between coronaviruses and toroviruses (dark gray). A fre- 

quency distribution of pairwise scores between random sequences (white) was also plotted 
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Coronaviruses and toroviruses are separated 
by a large interfamily-like distance 


To evaluate the foundations of the genera structure of the Coronaviridae, com- 
parative sequence analysis of two replicative domains of corona- and toroviruses, 
and two other nidovirus families —Arteriviridae and Roniviridae — was performed. 
The Coronaviridae-wide alignments of the RdRp and HEL domains (see above) 
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Fig. 10. Phylogenetic tree of putative RNA-dependent RNA polymerases of Nidovirales. The 
RdRp sequences of six coronaviruses (CoV), one torovirus (JoV), four arteriviruses (ArV), 
and one ronivirus (RoV) were aligned as described in the text. An unrooted dendrogram was 
generated with the Neighbor-joining method using the CLUSTAL X v1.82 program. Bootstrap 
values higher than 50% are shown in the main branch nodes. The sequences analyzed and their 
accession numbers are as follows: TGEV, strain Purdue, Q9TW06; HCoV-229E, Q9DLN1; 
PEDV, strain CV777, Q91AV2; MHV, strain A59, P16342; BCoV, strain Quebec, Q8V6W6; 
IBV, strain Beaudette, P26314; BEV, strain P138/72, P18458; PRRSV, Porcine respiratory 
and reproductive syndrome virus, strain Lelystad, Q04561; LDV, Lactate dehydrogenase- 
elevating virus, strain Plagemann, Q83018; SHFV, Simian hemorrhagic fever virus, strain 
LVR 42-0/M6941, P89132; EAV, Equine arteritis virus, strain Bucyrus, P19811; GAV, Gill- 
associated virus, Q9WPZ7 
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were expanded to include representative sequences of the other two families of 
the Nidovirales order. 

Sequence identities of every sequence pair were extracted from the RdRp 
and HEL alignments. The pairwise identity percentages between sequences of the 
genera Coronavirus and Torovirus (from 22 to 25% and 21 to 25%) were very close 
to or overlapped with the value ranges (from 12 to 22% and 17 to 25%) obtained 
for pairwise interfamily comparisons, but were much smaller than values derived 
from pairwise comparisons for the Coronavirus genus (from 63 to 71% and 59 to 
67%) (Fig. 9). 

The RdRp and HEL alignments were also used to infer the neighbor-joining 
dendrograms that revealed different topologies. The position of the torovirus 
branch in the RdRp dendrogram was not confidently resolved (Fig. 10), and 
roniviruses interleaved between corona- and toroviruses in the HEL dendrogram 
(Fig. 11). Despite the observed differences, which may be due to technical reasons 
(see above), this phylogenetic analysis confirms toroviruses and coronaviruses to 
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Fig. 11. Phylogenetic tree of RNA helicases of Nidovirales. An analysis was essentially 
identical to that described in Figure 10 except the HEL rather than RdRp domain sequences 
were processed 
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be separated by a large distance that is comparable to those between established 
nidovirus families. 


Discussion 


In this report we performed comparative sequence analysis of six functionally 
different proteins to revise the taxonomy of the Coronaviridae. The analyzed 
sets of proteins varied significantly in respect to the number and diversity of 
sequences, the S set being the most numerous and diverse, and the HEL and 
RdRp sets being the smallest. The results quantified the intervirus relationships 
for the Coronaviridae and, with some minor variations tolerable for the taxonomy 
classification, were consistent for all proteins analyzed. 


Three coronavirus genetic groups have diverged sufficiently 
enough to form separate genera 


Both pairwise scores and NJ-based phylogenetic analyses confirmed three genetic 
groups within the current Coronavirus genus that have been identified previously 
[67] and observed in analyses of other coronavirus sequences [11, 31,76]. Do these 
groups belong to the same genus or are they prototypes for separate genera? The 
current guidelines for virus taxonomy do not provide firm quantitative sequence 
criteria, but rather leave it to each taxonomy study group to design and validate 
ranks for a family for which it is responsible [79]. As a result, intra- and inter- 
rank pairwise sequence conservation varies tremendously among different virus 
families. For instance, the interfamily conservation among RdRps of viruses of the 
genus Sobemovirus and a Luteoviridae genus is around 45% identical residues, 
and RdRps of the Comoviridae family share 37%, 26% and 24% pairwise identity 
with homologs of the Sequiviridae, Picornaviridae and Caliciviridae families, 
respectively. In contrast, significantly lower numbers characterize the intergenera 
RdRp conservation in the Togaviridae and Flaviviridae [88] (A. Gorbalenya, 
unpublished observation). These figures are correlated with other characteristics 
including genome organization and expression. To revise Coronaviridae taxon- 
omy, we decided to follow defacto criteria that discriminate ranks in several 
complex virus families related to the Picornaviridae with which nidoviruses 
showed distant affinity [26]. 

Inspection of the intergroup pairwise scores obtained for the four structural 
proteins typical for coronaviruses show that the low end of these numbers (from 
9 to 23%) is in a twilight zone occupied by highly diverged proteins [19]. Ac- 
cordingly, no homologs of the N and E proteins have been found outside corona- 
viruses, and the only (very) distant homologs of S and M proteins are encoded 
by toroviruses (see below). In another test, we compared the pairwise scores 
obtained for structural proteins of coronaviruses and viruses of other densely 
populated families. It was revealed that the obtained inter-group pairwise scores 
for coronaviruses are similar to those calculated for the coat proteins of different 
genera of the Potyviridae family that have from 18 to 31% identical residues 
[65]. Accordingly, the coronavirus intra-group pairwise scores were closer to 
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the intragenera figures, which, for instance, show that amino acid identity in 
structural proteins of the Picornaviridae exceeds 50% [39]. 

Viruses belonging to different genera of the same family (e.g. Picornaviridae) 
also and typically have unique genus-specific features in their genome organisa- 
tions [36, 39]. We analyzed coronaviruses from this perspective. All coronaviruses 
maintain a set of essential genes in an invariable order (rep-S-E-M-N), although 
there are other, apparently non-essential genes, whose presence and location 
varies and may be group-specific (Fig. 12A). Only group 2 members include the 
hemagglutinin-esterase (HE) gene, and only group 3 viruses have a gene located 
between the M and N genes [6, 44]. Also there are group-specific differences in 
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Fig. 12. Genome organizations of the prototype Nidovirales members. The genetic structures 
of prototypes members of Nidovirales has been deduced from previously published data [14— 
16, 21, 26, 44, 71, 73]. A Consensus genetic structure of group 1, 2 and 3 coronaviruses. 
Letters on the top of boxes indicate the genes for the replicase and the structural and non- 
structural proteins. Patterned boxes stand for nonstructural protein genes that are common to 
all (dotted boxes) or only some (striped boxes) species in that group. B Probable genetic 
structure of the torovirus BEV. C Genetic structure of the arterivirus EAV. D Genetic 
structure of the ronivirus GAV. Letters above the boxes indicate genes name. L, leader 
sequence; HE, hemagglutinin esterase; S, spike protein; E, envelope protein; M, membrane 
protein; N, nucleoprotein; /, internal ORF; An, poly-A tail; GS, small glycoprotein; GL, large 
glycoprotein; GP3 and GP4, non-essential envelope glycoproteins; BEV HE* gene has an 
incomplete sequence as compared with the coronavirus HE gene. 2*, sequence homologous 
to the non-structural protein 2 present in group 2 coronaviruses 
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the expression patterns of two genes. The replicase has an extra protein released 
from its N-terminus in the coronavirus groups | and 2 but not in group 3 members 
[91]. Likewise, the spike protein is posttranslationally cleaved into two halves in 
the coronavirus groups 2 and 3 but not in every group | members [8]. Although 
it is tempting to believe that the above characteristics are indeed group-specific, 
analysis of a more diverse set of coronavirus genomes will be essential to verify 
this perception. 

Collectively, our analysis indicates that using criteria that were derived from 
analysis of the taxonomy of other virus families, the current three genetic groups 
of the Coronaviridae must each be elevated to the genus rank. 


Do coronaviruses and toroviruses separate 
into subfamilies or families ? 


For the sake of consistency, the taxonomic revision proposed above must be 
accompanied by similar elevation of the Coronavirus genus, which currently 
unites these three groups, and the Jorovirus genus to higher rank(s). There are two 
possible ranks to consider — subfamily or family. The subfamily rank is used in a 
few virus families and one order [48, 80]. To justify its usage in the Coronaviridae, 
the toroviruses and coronaviruses must share (significantly) more characters in 
common than each of them has with the other two Nidovirales families. 
Analysis of virion architecture of nidoviruses revealed different nucleocap- 
sid organizations in four major lineages with toroviruses having toroidal and 
coronaviruses having icosahedral internal virion structures [23, 24, 72]. These 
specifics are correlated with the lack of homologs of essential or semi-essential 
coronavirus N and E genes [43, 51] in toroviruses [72]. In contrast and unlike 
arteriviruses and roniviruses, toroviruses do have homologs of S and M proteins 
of coronaviruses that are encoded in the same (gene) order [70, 71]. Protein 
conservation is very weak and includes a previously recognized region in the 
M protein [18]. Accordingly, virions of coronaviruses and toroviruses slightly 
resemble each other (Figs. 13A and B) and differ from those of arteriviruses and 
roniviruses (Fig. 13C and D). The most characteristic feature common for virions 
of coronaviruses and toroviruses is the presence of the large peplomers that are 
formed by trimers of the S protein protruding from virion envelope (Figs. 13 A and 
B). Toroviruses also have homologs of the non-essential genes 2 and HE unique 
to the group 2 coronaviruses, where they are encoded between the replicase and S 
genes [12]. In toroviruses, these genes are located in other non-adjacent positions; 
an nsp2 homolog is encoded immediately upstream of the replicase frameshift as 
part of the replicase and an HE homolog maps between genes M and N (Fig. 12 
A and B) [69]. In the equine torovirus the HE gene is partially truncated [12]. 
Coronaviruses, roniviruses, and arteriviruses, and most probably toroviruses 
[68], have the replicase gene encoded in two overlapping ORFs (ORF la and 
ORF 1b), the last of which is expressed through a frameshifting mechanism (Fig. 12 
B, C, and D). Corona- and toroviruses, compared to roni- and arteriviruses, 
showed the longest collinearity in the replicase 1b region, as reported elsewhere 
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Fig. 13. Architectures of virus particles of the prototype Nidovirales members. Shown 
are cartoons depicting possible organizations of A Coronavirus [21]. B Torovirus [71]. C 
Arterivirus [73]. D Ronivirus (modified from an scheme by Peter Walker, Indooroopilly, 
Australia [75]) virions as deduced from available results. MEM, lipid membrane; CS, core 
shell; NC, nucleocapsid; N, nucleoprotein; M, membrane protein with the amino-terminal 
end facing the outside of the virus and the carboxyterminus located inside the virion; M’, 
membrane protein with both the amino- and the carboxy-terminal end facing virus surface; 
E, envelope protein; S, spike protein; HE, hemagglutinin-esterase; GS, small glycoprotein; 
GL, large glycoprotein 


[13, 16, 26]. For instance, arteriviruses seem to encode an unrelated protein to the 
putative methyltransferase domain, which is released from the C-terminus of repli- 
case and may lack a homolog of the putative exonuclease domain, which is con- 
served between HEL and the nidovirus-specific domain in corona- and toroviruses 
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[26] (Snijder, Bredenbeek, Dobbe, Thiel, Ziebuhr, Poon, Guan, Rozanov, Spaan 
and Gorbalenya, submitted). This comparative analysis cannot currently be 
extended further to the N-terminus to include the la part of replicase, as the 
corresponding torovirus sequence is not in the public domain. This region, which 
includes more than 4000 amino acid residues in coronaviruses and roniviruses, is 
extremely poorly conserved among established families [13, 18, 26, 68, 89]. Future 
nidovirus-wide analysis of this region may be highly informative for taxonomic 
purposes. 

The sequence affinity between toroviruses and coronaviruses summarized 
above was surprisingly not evident in our phylogenetic analysis of the two most 
conserved replicase domains of nidoviruses using the NJ algorithm. In contrast, 
in other analyses of the RdRp domain the expected clustering of coronaviruses 
and toroviruses was observed [13, 30]. The latter datasets included outgroup 
sequence(s), although the exact reasons of the observed striking variations of the 
tree topology may be more complex. Particularly, very large distances between 
four major lineages of the nidoviruses and significant under-representation of toro- 
and ronivirus sequences may have negatively affected the reliability of alignments 
and phylogenetic inference in our analyses. We believe therefore that the nidovirus 
phylogeny must be verified later when the number and diversity of torovirus and 
ronivirus sequences will match those of arteriviruses and coronaviruses. 

Regardless of the actual topology of the nidovirus tree, the evident relatively 
large distance between RdRps of coronaviruses and toroviruses might alone be 
sufficient to justify separation of these viruses into different families, as has been 
argued by others [82]. Indeed, compared to the 22-25% amino acid residue identity 
between RdRps of coronaviruses and toroviruses, the RdRp conservation among 
a number of viruses that belong to other families is (substantially) higher (see 
above). Likewise, toro- and coronaviruses do not group together in respect to 
the transcription mechanism to express 3’-located open reading frames. For this 
purpose coronaviruses (and equally arteriviruses) employ discontinuous transcrip- 
tion of the genomic RNA [44, 61, 78], while toroviruses appear to rely upon 
both discontinuous and continuous transcriptions [81]. In this respect, roniviruses 
differ further as their sgsRNAs may be produced exclusively through continuous 
transcription [14]. 

In summary, toroviruses and coronaviruses do have a number of important 
characteristics in common that group them together and separate them from 
arteriviruses and roniviruses. This list of common properties may well be extended 
in future when more sequences become available and characterization of these 
viruses advances. However, until criteria differentiating the family and subfamily 
ranks for viruses in general are clearly formulated, it remains a matter of personal 
preference to choose which characteristics — the numerous common or the few 
unique ones — should weigh more for revising the taxonomy of coronaviruses and 
toroviruses. Either of two possible decisions — to assign the family or subfamily 
rank to coronaviruses and toroviruses — seems to be compatible with the current 
guidelines of the ICTV and either of them may not satisfy everybody in the 
field. 
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